FOIL it! Find One mismatch between Image and Language caption

نویسندگان

  • Ravi Shekhar
  • Sandro Pezzelle
  • Yauhen Klimovich
  • Aurélie Herbelot
  • Moin Nabi
  • Enver Sangineto
  • Raffaella Bernardi
چکیده

In this paper, we aim to understand whether current language and vision (LaVi) models truly grasp the interaction between the two modalities. To this end, we propose an extension of the MSCOCO dataset, FOIL-COCO, which associates images with both correct and ‘foil’ captions, that is, descriptions of the image that are highly similar to the original ones, but contain one single mistake (‘foil word’). We show that current LaVi models fall into the traps of this data and perform badly on three tasks: a) caption classification (correct vs. foil); b) foil word detection; c) foil word correction. Humans, in contrast, have near-perfect performance on those tasks. We demonstrate that merely utilising language cues is not enough to model FOIL-COCO and that it challenges the state-of-the-art by requiring a fine-grained understanding of the relation between text and image.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

بررسی تأثیر نمایه‌سازی مفهوم-محور تصاویر بر بازیابی آن‌ها با استفاده از موتور جستجوی گوگل

Purpose: The purpose of the present study is to investigate the Impact of Concept-based Image Indexing on Image Retrieval via Google. Due to the importance of images, this article focuses on the features taken into account by Google in retrieving the images. Methodology: The present study is a type of applied research, and the research method used in it comes from quasi-experimental and techno...

متن کامل

Effects of Closed-caption Programs on EFL Learners’ Listening Comprehension and Vocabulary Learning

This study aimed at investigating the impact of closed-caption program on listening comprehension of English movies and vocabulary learning. Sixty-four graduate students studying at Shiraz Islamic Azad University were selected as the participants of the study. The participants were divided into two groups: experimental group (with closed caption program) and control group (without closed captio...

متن کامل

Studying the Effectiveness of Security Images in Internet Banking

Security images are often used as part of the login process on internet banking websites, under the theory that they can help foil phishing attacks. Previous studies, however, have yielded inconsistent results about users’ ability to notice that a security image is missing and their willingness to log in even when the expected security image is absent. This paper describes an online study of 48...

متن کامل

Leveraging Visual Question Answering for Image-Caption Ranking

Visual Question Answering (VQA) is the task of taking as input an image and a free-form natural language question about the image, and producing an accurate answer. In this work we view VQA as a “feature extraction” module to extract image and caption representations. We employ these representations for the task of image-caption ranking. Each feature dimension captures (imagines) whether a fact...

متن کامل

Names and Faces

We show that a large and realistic face dataset can be built from news photographs and their associated captions. Our dataset consists of 44,773 face images, obtained by applying a face finder to approximately half a million captioned news images. This dataset is more realistic than usual face recognition datasets, because it contains faces captured “in the wild” in a variety of configurations ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017